Table of Contents

Modelling PBSA Rental Rates

1. Introduction

In this project, we explore modelling Purpose-built Student Accommodation ("PBSA") weekly rental rates in the UK for the AY24/25 by considering a range of factors.

We have a dataset that details the weekly rate being charged for different sub-classifications of room types for different assets across the UK. Our aim is to build a model that can predict the rental rate to be charged for a specific room type in a specific asset based upon the features we have data for.

Other features we have include the city, postcode, operator, tenancy length, and rental information for previous academic years. The dataset also contains slightly intermittent data for typical room sizes too.

We will start by importing the data and refining it, given its sporadic nature. We will then perform some Exploratory Data Analysis ("EDA") to better understand our data. We will then assess the data's suitability for a Linear Regression model, which shall be the first model we try, given its simple and interpretable nature. Then we shall engineer our features to ensure the data is ready for the model to be trained upon it, before training the Linear Regression model upon the data and assessing the accuracy of this model. We shall then consider two further modelling approaches, including a Random Forest and a Gradient Boosting model, before refining a chosen model and evaluating the final, iterated version.

2. Data Cleaning

2.1 Introduction

In this section, we shall import the data and begin cleaning it. We shall look to refine the dataset to ensure it does not include any defunct information, before correcting obvious errors and dealing with missing values. Our aim here is to prepare the dataset so it is ready for EDA. We shall also ensure there is consistency in our data, which shall include checking for duplicates, amongst other things.

2.2 Tidying the DataFrame

We start by importing the necessary modules we shall need for this section.

We now import the data into a Pandas DataFrame and do some initial analysis on the first five rows as well as examine the columns we have and the corresponding data types for each column.

As we can see, the DataFrame rental_data comprises $12,426$ rows and $39$ columns. All of the columns, except for the last two, are of the object data type. We also note that each column has differing numbers of non-null values, suggesting that different rows may be missing different parts of data.

We want to work with a smaller version of this DataFrame that only includes the columns we are interested in. We seek to remove all columns with defunct information, such as 'Dataset Issue' and 'ID', as well as patchy information like 'Room Size - Revised with any new intelligence', and information pertaining to other academic years. We will also leave just one geographic variable that we shall consider to avoid multicollinearity. In this case we keep 'city'.

Finally, we drop 'Sub Classification'. This is a useful piece of information that tells us the differing sub-types of room types which results in a range of prices for one room type in any one asset. However, different operators use different classifications, such as 'Bronze', 'Silver', 'Gold' or 'Standard', 'Premium', 'Premium Plus', 'Luxe' etc and this different terminology makes it impossible to truly aggregate this data on a like-for-like basis. So for the purposes of this study, we have elected to remove this level of granularity.

Above we see the names of the columns we have remaining. We now seek to standardise these names.

2.3 Cleaning the Data

We now look to start cleaning the data we have.

We start with room types, where want to standardise the classifications.

As we can see, there are different variations of the same room types, as well as nonsensical values such as $0$ and $331$.

Below we create a list of what we want to map each of these unique values to, before creating a dictionary of the two lists and using the map() function to amend the data in rental_data.

We have chosen the following classifications: 'Studio', 'En-Suite', 'Non En-Suite', 'One Bed' and 'Twin'. Some clearly missing values are classified as NaN.

We note that we cannot use rows where this data is NaN. Furthermore, the weekly_rent column for the 'Twin' rooms can be misleading as it sometimes states a value to be paid on a per person basis and other times on an entire room basis. For these reasons, we remove the rows where the room type is one of NaN or 'Twin'.

We now consider the operators and repeat a similar exercise.

We repeat the process for city and note that there are no duplicates and that the names are already standardised.

We repeat the exercise for beds and check the minimum is above 0. We then drop the rows with non-numerical and missing values.

We now turn our attention to build_date.

As we can see, some of the data contains two years in the event of there being a refurbishment. There are also years stated as $201$, $2094$, and $199$, which appear to be incorrect or missing a digit. Below we investigate which assets these years relate to and see if we can manually find the correct year using other sources.

We note that Oak Brook Park has no weekly_rent data and therefore is likely to be dropped before modelling. That leaves Snow Island, 134 New Walk, and X1 The Campus.

External sources reveal that X1 The Campus was completed in 2018, Snow Island completed in 2012, and 134 New Walk was refurbished in 2016. We will amend these in the dataset.

Below we replace the years with accurate ones. We have labelled $201$ as $2018$ to fix the X1 The Campus figure and will independently change Snow Island's build date after. We have also labelled $2094$ as $2016$ for 134 New Walk. In the case of refurbishment, we select the newer date as the date is supposed to represent a proxy for the specification of the property. Ones that could not be deciphered were labelled as np.nan. We note that we have labelled Oak Brook Park as np.nan given the year is inconsequential as the data point will be removed later, given that there is no weekly_rent data.

We now consider the independent variable - weekly_rent. Below we replace the whitespace numbers with NaN using regex before removing the '£' sign and the ',' from the numbers. This gives us the ability to convert the column into a float data type.

Now we have got all of the variables in the correct data type and have removed the NaN values from the other columns. We now need to deal with the missing values in the weekly_rent column.

We note that many of the missing data refers to room types or tenancy lengths discontinued from previous years and is therefore less damning then first appears. Given removing the empty rows still leaves us with $8,000$ data points, we will drop the rows where there is no weekly_rent rather than attempting to interpolate the data points to ensure we are working with more accurate, if slightly less complete, data.

We note that following the removal of the 'Sub Classification' column, we have a range of weekly rents for each room type for each asset, without the sub-class to aid understanding. In order to simplify the data for modelling, we will need to aggregate this data. Common aggregation techniques include either the mean or the median.

In PBSA assets, it is not uncommon to have one or two incredibly premium rooms at the top of the asset which benefit from typical floor uplift and/or size benefits and/or the presence of balconies or access to exterior space. As a result of this, there will be assets which have outliers in the range of weekly rents for a specific room type. As a result of this, we have decided to calculate the median for each room type on account of it being less prone to being affected by extreme outliers.

We also recast the data types for asset, room_type, and city as strings.

We now create a unique asset_id for each asset to have something clear to aggregate the data by. We then reorder the columns to bring the asset_id to the front.

We now have the cleaned up data with each asset having a standard set of room types which have a single value ascribed for the weekly rent, which reflects the median of all the rents that were being charged for the sub-types of that room type. We now proceed to EDA.

3. Exploratory Data Analysis ("EDA")

3.1 Introduction

In this section we shall consider key summary statistics for each variable as well as visualising the distributions of the data, the relationships between various factors, and assessing and dealing with any outliers that arise.

Below we import further modules we shall need for this section.

3.2 Summary Statistics, High-Level Data Visualisation & Outliers

We start by considering the summary statistics for each variable. This gives us a sense of the size of our data set and allows us to more intuitively understand the data.

From these summary statistics, we immediately notice that there are $1,725$ entries in our data set, reflecting $938$ different assets (with only $921$ unique assets as some share names), $64$ unique cities and $107$ operators. We note that London contains the most assets in our data set, which is not surprising given it contains by far the most Higher Education Institutions ("HEIs") in the UK and the largest student population. Furthermore, we note that UNITE Students operates the most assets in our data set.

Regarding beds, we note that the data appears to be positively skewed and there is large variance in the bed numbers as seen by the standard deviation and the range. build_date seems to be slightly negatively skewed, which is unsurprising given the boom in the PBSA market as an alternative investment class leading to an increase in development to keep up with rising student numbers. Finally, we note that weekly_rent appears to be positively skewed and that there is also a large range with a maximum weekly_rent of $868$ pounds per week.

We now consider the distribution of weekly_rent by examining a histogram and a box plot of the data.

As first thought by looking at the summary statistics, the weekly_rent is indeed positively skewed. It is likely we will need to transform this data using a logarithmic transformation in order to prepare if for modelling. We will deal with this at a later stage.

Furthermore, the box plot seems to suggest there are a large amount of outliers. We will need to explore this further before deciding how to deal with this. However, from a first look, the 'outliers' grouped beyond the maximum value are likely due to different cities having wildly different weekly_rents (largely on account of differing sub-market real estate dynamics) rendering entire markets as 'outliers' given the positive skew of the data. An example would be London, which will charge far higher than the rest of the UK on account of its increased demand and the competing land uses in the capital rendering developments only viable on account of higher rents. However, we note that this may not be the case for all of the 'outliers' on the box plot above. Notably, the rent at almost $900$ pounds per week does indeed appear to be an outlier.

Given this, we consider the distribution and box plot of the weekly_rent on a city-by-city basis.

For some cities, there doesn't seem to be enough data points to ascertain the shape of the distribution. However, for the larger cities, where there are more data points, generally there is a slight positive skew. Examples include London, Birmingham, Glasgow, and Leeds.

Below we consider the box plots on a city-by-city basis.

Immediately, we notice a lot fewer obvious outliers, which lends credence to our theory concerning the different cities having different markets.

However, there are still some notable exceptions, such as on the Edinburgh, Cardiff, Coventry, Lincoln, and Sheffield plots.

We note that for the cities where there are few data points, the sporadic nature of the data can lead to some points being identified as outliers when they may in fact reflect a reasonable top of the market.

Furthermore, in a similar manner to how when considering the data set as a whole, we found lots of outliers given the differing cities having different rental landscapes, we may be seeing the same above, given we are looking at all room types together and there are different markets for one-beds as there are for non en-suites. Thus, we subdivide further and consider the box plots of each room type on a city-by-city basis.

Between this set of box plots and the box plots by city, we can identify which proposed outliers we think are true outliers. This will comprise the data points that reflect a typo or an error and we will remove them or correct them through additional verification. In order to do this, we will examine the outliers identified between the two sets of plots and use knowledge of reasonable market dynamics to identify the true outliers.

Below we have that iQ Brighton's non en-suite is actually a two-bed apartment. We are unable to confirm whether the price stated is actually for the apartment as a whole and should be split between two or is per person. iQ seems to have a lot of rooms like this and given we cannot confirm the accuracy of the data point, we elect to remove them.

The outlier for Bristol en-suites shows that an asset called King Square Studios has a large two-bed flat being identified as an en-suite. We are unable to confirm an accurate per person per week price and so we remove it for simplicity.

Below we see that Vita Student's Cannon Park asset is an outlier for en-suites in Coventry. This is as a result of there only being two data points for en-suites in this asset, priced at $277$ and $~978$ each, with the median taking the halfway point. Given the $978$ is a three-bed flat, we will correct this to take just the value of the $277$ cluster, as advertised by their website.

We also note that the one-bed market has some questionable values in Coventry.

We note that the asset at 33 Parkside is actually offering two-bed apartments at that price. We remove them as we are unable to ascertain the true per person price. The Infinity asset also has an outlier, but we note that Novel Student is an incredibly premium operator and can charge those prices as it reflects the service offering they have. We remove the 33 Parkside entry but leave the Infinity entry below.

We also note that Burges House has outlying values for its studios and en-suites which do not agree with one another, with the studio being an outlier in the market. For this reason, along with the fact that we have a lot for data points for Coventry anyway, we remove all Burges House entries.

Again the Vita Student asset contains a three-bed that is being marketed at $868$ per week for the entire apartment. We replace this with $\frac{868}{3} = 289.33$.

Riverside House in Guildford has dual occupancy studios that are being classified as en-suites. We remove as they are an inconsistency.

iQ Leeds in Leeds is another example of a mislabelled two-bed apartment that we cannot verify. We remove it.

Similarly, the non en-suite at iQ Pavilions is actually a two-bed apartment and so we remove it for similar reasons.

The YPP Gravity Residence asset in Liverpool contains a two-bed that is being marketed at $363.46$ per week for the entire apartment. We replace this with $\frac{363.46}{2} = 181.73$.

We note that the Chapter non en-suites in London above $300$ are actually two-bed and three-bed apartments and so we remove them as we cannot verify the per person rental charge.

We note that the Luna Hatfield asset is an outlier for studios. This is a result of it being in Hatfield but being classified as London. Whilst there are other areas around London, such as Egham, that clearly benefit from the proximity to the capital and whose rental tones reflect as such, this asset in Hatfield is further away and does not appear to be in line with what we would expect. Furthermore, it is the only asset in our data set in Hatfield, so we are reluctant to reclassify the assets to a new city. We therefore elect to remove it.

The en-suites at Apex Heights are actually two-bed en-suite clusters priced at $115$ per week per person. We revalue them as such.

Whilst it has not been classified as an outlier, there appears to be another iQ non en-suite in Manchester at Kerria Apartments far above the rest of the market. Further research reveals these are in fact two-beds and they are removed for reasons as above.

Again the Vita Student asset contains a two-bed that is being marketed at $405$ and $446$ per week for the entire apartment. We replace this with $\frac{425.50}{2} = 212.75$.

This asset in Nottingham contains a two-bed that is being marketed at $405$ per week for the entire apartment. We replace this with $\frac{405}{2} = 202.5$.

We note that the singular Oxford non en-suite is obviously incorrect and actually a one-bed flat. We relabel it below and then recalculate the median price for a one-bed at West Way Square as there is already that classification.

The Friargate Court asset in Preston is achieving far above the rest of the market to an unrealistic extent. It is also unclear whether it is solely student accommodation as it appears to be marketed as residential for working people too. Therefore, we remove it to maintain consistency.

Sheffield purportedly has a few outliers. We start by examining the studios above $230$ per week to better understand the top of the market. We note that the three outliers as per the box plot are all above $250$.

The assets achieving a premium are not ridiculously above the rest of the market and we note that Hillside House, which is achieving the highest rents in the market, is operated by Novel Student, which are known for being a premium PBSA operator.

Again the Vita Student asset contains a two-bed and a three-bed that is being marketed at a price for the entire apartment. We replace this with $174.17$, which is the true median.

Similarly, Sovereign Newbank House is being marketed for the entire two and three bedroom flats. We correct it to the true median.

Given the established price of studios, the price for one-beds seems reasonable within the context of the market in Sheffield at the upper end, but we note there is an extremely cheap one-bed at iQ Steel. Further examination shows this is double occupation and the price we have recorded is for one person. We double it to attain an accurate sense of the real price.

This Stanley Studios non en-suite is effectively a high-end two-bed flat and so is inconsistent with what we want to capture by non en-suite. For that reason, we remove it.

These en-suite and non en-suite rooms in Dunn House are actually dual occupancy studios and so we remove them.

iQ Fiveways House in Wolverhampton has another difficult iQ two-bed apartment. Again, we cannot ascertain whether the price listed is for the the entire apartment or per person. We remove it for consistency.

3.3 Visualising the Features

We now examine the weekly_rent in comparison to the categorical variables, starting with the distribution of weekly_rent by room_type.

As we can see, each room type showcases that same positive skew we saw in the wider data set. We now plot the box plots for the weekly_rent by room_type.

As expected, we see that one-beds have the highest median, followed by studios, en-suites, and then non en-suites. We note that the outliers, in a similar fashion to when we plotted the larger data set, are as a result of there being different regional sub-markets pertaining to individual cities.

We now examine the frequency of the different build dates.

We note that the build dates is indeed negatively skewed, as first thought from considering the summary statistics. Again, this make sense given the greater exposure investors are obtaining to PBSA of late, with developers incentivised to build PBSA for a 'hot' market with the strengthening fundamentals of rising UK student numbers supporting this decision in recent years.

It appears that the 'boom' in PBSA development occurred between 2014 and 2020, with the number of private PBSA assets in our data set that have opened in the last few years decreasing. Perhaps this is as a result of student growth slowing down and untapped demand decreasing, given the explosion in development, leading to developers looking to other asset classes. Alternatively, the slowdown in development could be a response to the stricter building regulations post-Grenfell disaster.

Irrespective of the underlying causes, the data lies within the expectations we would have for the distribution of the build dates.

We now consider the distribution of the bed numbers.

Here we can see a clear positive skew, with most assets being between $0$ and $300$ beds in size. This data will likely need to be transformed before modelling.

We now examine the frequency of different categorical factors in order to further understand the data. We do so by examining the number of beds by category.

We have that, as expected, and as revealed by the summary statistics, London has the most PBSA beds in the UK, with other large cities not far behind. We note that the number of PBSA beds in London is almost double that of the next city - Sheffield, which highlights big the London PBSA market is.

Here we see that UNITE Students is by far the largest operator of PBSA in the UK with over $40,000$ beds, according to our data set. We note that there are a few very large operators, such as UNITE, iQ, CRM, Student Roost, and the various Homes For Students brands, with a long tail of very small operators, who likely operate a few assets or less on a local, rather than national, scale.

We further consider all operators with more than $5,000$ beds to get a clearer picture of the largest operators in the market.

We note that UNITE Students offer a more basic offering as an operator and have amassed scale at reasonable rents with regard to the surrounding sub-markets. In contrast to this, operators such as Vita Student have considerable portfolios of scale offering some of the best services in the market, and therefore are likely to achieve rental premiums to the surrounding markets as a result of their heightened service offering.

3.4 Conclusion

We now have a clearer picture of some of the key features of the data and have ensured accuracy by removing or correcting the offending outliers. We note that from a quantitative perspective, both weekly_rent and beds are positively skewed and likely to require transformations. On the other hand, build_date has a negative skew, which may also need transformation.

We have also considered the different sub-market dynamics by examining the distribution of weekly_rent by city and ascertaining that where there is scale, there is also a positively skewed distribution. Finally, we have considered some of the categorical variables to reveal that the city with the most beds is London and that UNITE Students operate the largest PBSA portfolio in the UK.

Given the first model we wish to attempt to fit to our data is a Linear Regression model, in the next section we consider the suitability of our data set for such an application.

4. Suitability of Linear Regression

4.1 Introduction

In this section, we will verify that a Linear Regression is indeed a suitable model to apply to this data. We will do so by examining the relationship between the different variables and weekly_rent to see if there is indeed an underlying linear relationship.

4.2 Numerical Features

We start by examining the relationship between weekly_rent and the other numerical data.

Above we can see that there does indeed appear to be a slight positive relationship between build_date and weekly_rent. This suggests that newer buildings tend to achieve higher rents, which makes logical sense if we consider that newer buildings are more likely to have newer, more modern specification, increased amenity space, and benefit from increased curb appeal.

Here we see that the number of beds appears to have a very slight positive correlation with weekly_rent. However, the correlation is so small, it appears to be effectively zero, suggesting that there is no correlation between the size of an asset and the rent it commands.

Whilst some may argue that a larger asset is more likely to have more amenity space, rendering a scheme more attractive, there are other dynamics at play. Firstly, this is not always the case. Secondly, albeit there may be more amenity space, that amenity space is shared between a larger group of people thus reducing the attractiveness of that amenity space. Thirdly, larger schemes are more likely to have a wider array of rooms with a bigger mix of clusters and studios, thus having an increased range of weekly rents, with cheaper rooms bringing down any average obtained from an increase in amenity space. This results in the fairly absent correlation we see above.

Given this lack of relationship, we may consider dropping the variable from our model as it appears to contribute little. Before making any decisions, we further consider whether there is a relationship between the logarithm of beds or, alternatively, if grouping the assets by bed numbers shows some sort of ordinal relationship.

Here we see there is still very little correlation, further supporting dropping the variable if models do not deem it significant.

Above we have classified the assets into 'sub-scale', 'normal', 'large' or 'oversized' categories based upon the number of beds. Still we see very little relationship, if any, adding further credence to dropping the variable. We shall keep the variable in for now and assess its impact on the model once we are evaluating it, perhaps removing it at a later date.

Below we create a correlation matrix between the numerical values to asses the strength of the linear relationships.

As we can see, there is indeed a weak positive correlation between build_date and weekly_rent of $0.23$, insinuating that there is some evidence to suggest that newer buildings charge a premium.

Furthermore, a correlation of 0.07 between beds and weekly_rent suggests no meaningful correlation. Therefore beds may not significantly contribute to the model.

4.3 Categorical Features

We now turn our attention to the categorical features - namely room_type, operator, and city.

Here we see that the categories show a clear separation from one another, especially in the case of one-beds, studios and en-suites. We note that non en-suites and en-suites have a fairly similar profile however the latter does still have slightly higher quartiles.

Above we have plotted box plots for the weekly_rent by operator for the $20$ largest operators in the UK. We have selected only these operators as plotting all $100+$ operators would be impractical to draw conclusions from and many of the smaller operators only manage a a few assets meaning the data is low volume and therefore less insightful. Furthermore, by selecting the $20$ largest operators, we are accounting for the data that is most common in our data set and therefore the more likely to be influential.

We note that, even amongst our subset of operators, we can see differences in the ranges and medians achieved. Some of these differences will be explainable by the sub-markets these operators are active in. For example, Chapter London is a brand that exclusively manages assets in London, which explains why their achieved rental tone is so consistently above other operators. UNITE and iQ, on the other hand, operate thousands of beds on a national scale which leads to a larger range at a lower price point on average.

Despite this, and considering that the majority of these operators have assets across the UK, which somewhat mitigates the influence of sub-market dynamics, we can see that there are distinctions in rental tone amongst operators. For example, UNITE and Homes For Students typically provide quite a basic level of service and this is reflected in their lower median and minimum weekly_rents. On the other hand, the aforementioned Chapter London, Scape Student Living, and Vita Student are known for providing a premium service at the top end of the market, which we also see reflected here. This distinction between operator provides credence for the application of a Linear Regression Model.

Similarly to operators, there are too many cities to consider all at once. However, we have looked at the $30$ most supplied cities by beds. Again, this provides the most data points, avoids being too congested to draw conclusions, and benefits from providing insight into the most influential data.

Once again, we see clear regional differences supporting that this variable could be meaningful for a Linear Regression model. Naturally, London sees the highest median weekly_rent with the highest maximum value too. This is to be expected on account of the competing land uses in the nation's capital driving up land prices and impacting rental rates too. This is further compounded by London having nearly $40$ HEIs and therefore increased demand for PBSA.

Furthermore, other chronically undersupplied cities that see higher rents across all residential sectors, such as Bristol and Edinburgh, also see higher rents achieved here. Bristol's undersupplied nature, resulting from draconian planning laws, has driven PBSA rents immensely over the last few years, which we see with the median weekly_rent approaching that of London. This influence of supply-and-demand dynamics is seen further in Glasgow, which benefits from five universities and one of the largest student populations in the UK, leading to attractive supply-and-demand dynamics, which in result drives rents, which we see reflected here.

This plot highlights clearly that there are sub-market dynamics within the UK and this could contribute towards effectively training a Linear Regression model.

Finally, we calculate and compare group medians for the categorical data.

Here we can see notable differences between the different classifications of room_type, which further suggests that the variable could be used to train a Linear Regression model.

Again, we see plenty of variation between the median achieved rents. In some cases, this variation is rather stark, again supporting that this factor may be influential.

Again, we see that some cities, such as London, Bristol, and Edinburgh, achieve high median weekly_rents. This stands in comparison to other cities, such as Carlisle, Bolton, and Bradford, who achieve much lower rates. This lends itself to the assumption that regional factors can influence rental rates and be significant predictors in a Linear Regression model.

4.4 Conclusion

In conclusion, we have established that a Linear Regression model would be a reasonable approach. Our analysis revealed that there is a weak positive correlation between build_date and weekly_rent, suggesting that newer buildings charge a premium. We also found there to be no significant correlation between the size of an asset and the weekly_rent, suggesting that this feature will be broadly insignificant for training a model. However, analysis of the categorical variables of room_type, operator, and city suggested that there are clear differences between these categories, which in turn supports that they will be influential in training a linear model.

We now consider the transformations and encoding needed to prepare this data for modelling.

5. Feature Engineering

5.1 Introduction

In this section, we shall consider each of the variables and ensure it is ready for the model to be trained upon it. This will involve considering any transformations or encoding that needs to take place and the rationale behind said changes.

Before we transform or encode our data, we are going to split our data into a training set and a test set. Splitting our data provides us with the ability to test our trained model on a completely unseen data set. We perform the split now to reduce data leakage and ensure any transformations we make to the data is done independently of the test set, which should be treated like a completely unseen set of data.

5.2 Numerical Features

We start by considering the numerical data variables - build_date and beds on the independent variable side and weekly_rent on the dependent variable side. We start with weekly_rent.

As we saw in the distribution of weekly_rent in Section 3, the weekly_rent data is quite positively skewed. We check this for the training data below.

Clearly, we have maintained that positive skew following the splitting of the data. We therefore elect to apply a log transformation as this will compress the data, reducing the affect of the extreme outliers that gives the data its positively skewed shape. Since none of the rents are $0$, we apply a simple transformation $f(x)$, where $f(x) = \log(x)$.

Clearly, the data is now more normal and suitable for Linear Regression.

Similarly, we had that the beds feature was positively skewed. We consider the distribution below for just a unique set of assets to ensure we are not double counting anything.

Again, we see that beds is positively skewed. We note that beds is actually a discrete variable and not continuous. Therefore, traditionally we would encode it using one of One-Hot Encoding ("OHE"), Target Encoding, or Ordinal Encoding, which we explain in more detail below.

However, given the size of the data set, the range of the variable, and the evidenced skewness, we are going to treat it as a continuous variable and apply a log transformation.

Again, this data is slightly better, however it still possible that this variable will be dropped as it is unlikely to influence the model, given the work we did in Section 4.

Finally, we consider the build_date, which is also a categoric variable. We elect to turn this into a continuous variable 'age' so that the model can more easily interpret the jumps between years.

We now consider the distribution of the age variable.

Here we can see the data shows a slight positive skew. Furthermore, as we have now reframed this data as continuous, we can treat it as such and apply a log transformation as above to reduce the skewness.

We note that there are some assets with age $0$ and thus use the transformation $f(x) = \log(x+1)$ as $\log(0)$ is not defined.

We notice that the distribution is quite sparse now and actually appears to be slightly negatively skewed. This suggests the original distribution was not that positively skewed and thus the log transformation is too severe. We consider a slightly less aggressive square root transformation where $f(x) = \sqrt{x}$ instead.

We note that whilst this isn't perfect, it has normalised the data somewhat. We also note that our work in Section 4 already highlighted that there was a weak linear relationship between the build_date variable and the weekly_rent, which should be understood by the Linear Regression model.

5.3 Categorical Features

Now that we have transformed the numerical features, we turn our attention to the categorical features, starting with room_type. Given it is a categorical variable, we need to encode the data. One type of encoding is OHE, where we create $n$ columns (or $n-1$) for the $n$ categories in our variable, where the column has a $1$ if the row is of that category and a $0$ otherwise. This is good as it captures all the granularity of the data albeit can increase dimensionality by adding extra columns. Another option is Ordinal Encoding, where we assume an order and assign a number $1$ through $n$ to each category. Another option could be Target Encoding, where we replace the categorical label with a number such as the mean or median of that category. This keeps a level of information about the variable and has a lower dimensionality in comparison to OHE.

Whilst the box plots created earlier show that there is indeed an ordinal pattern to the variable room_type, with non en-suites having the lowest median weekly_rent and one-beds having the highest, we note that there was not an extremely clear distinction between non en-suites and en-suites. Furthermore, given that there are only four categories for this variable, we will not be increasing the dimensionality too much if we use OHE. For those reasons, we will use OHE for the room_type variable.

We now consider the city variable. We note that there are $63$ different cities in the training set and that many cities do not have many data points, as seen from the bar chart in Section 3.

Again we have similar encoding options available to us as for room_type. There are too many unique variables here to use OHE explicitly and whilst we could piece together some ordinal data, it would be quicker to use Target Encoding. So, we have Target Encoding as a viable option. Alternatively, we could reduce the dimensionality and still use OHE by assigning each city to one of $12$ regions in the UK, namely: Scotland, Northern Ireland, Wales, North West, North East, Yorkshire, West Midlands, East Midlands, East, South West, London, or South East. That would leave us with $12$ categories, which after dropping the first column results in an increased dimensionality of $11$ columns, which is broadly acceptable.

Both methods have their merits. Target Encoding enables us to maintain our city-level granularity and ensures a lower dimensionality in comparison to OHE. However, given that $13$ cities have five or fewer samples, there is a chance that the model overfits and does not generalise well. On the other hand, OHE smoothens the data and relies upon the assumption that cities in similar regions of the UK achieve similar rents, which we can see from the box plot in Section 4 is not always the case. For example Edinburgh and Aberdeen would both be grouped into the Scotland category, but clearly have different rental distributions.

Given this difference within regions, we will use Target Encoding. However, we will smooth out the city medians calculated with the UK median from the data set, in an attempt to avoid overfitting. We start by calculating both the city_median and the UK_median, before then weighting the two to have a value to encode. We note that the weekly_rent has been log transformed and so this will affect these calculations.

We now wish to smooth these median rents to account for small sample sizes. We shall create a function that takes the UK median and the city median and weights the two by a weighting influenced by the size of the sample we have for an individual city. We shall then append this information to our training data.

We now have an encoded city variable for the training data.

Our final categorical variable to encode is the operator variable. Similarly to city, we can not just OHE this variable as there are over $100$ different operators and this will add too much dimensionality. This leaves two options, similarly to city. We can either group the operators into categories and OHE this lower dimensionality variable, or we can use Target Encoding in a similar manner to the above.

Irrespective of the method we choose, we need a way of classifying the operators numerically. One way we could achieve this is by considering the median weekly_rent achieved by each operator. This would work, however there are some operators, such as Chapter London and Urbanest, who have London-only PBSA brands. This would see them achieve much higher rents than a regional brand of equal operational prestige, solely on account of the city. Therefore, we need to account for this. Similarly, we need to account for the room_type as an operator with solely one-beds will achieve a higher rent than an operator with just non en-suites, even if they are operating in the same city.

Therefore, to isolate the effect an operator is having on the weekly_rent being achieved, we need to calculate some sort of premium on a city and room_type level for each operator. We can do so by comparing the median weekly_rent achieved by an operator in a certain city for a certain room_type to the median weekly_rent achieved by all operators across that city and room_type. This gives us a multiplicative factor that reflects the operator is contributing to the weekly_rent (premium $\gt 1$) or having a negative effect on the weekly_rent (premium $\lt 1$).

Below, we calculate the premiums for each operator on a city and room_type level. We start by adding back in the city_name column and the room_type column as well as the weekly_rent data in order to produce the calculations.

We have the following average premiums by operator.

We now plot these premiums to get an idea for the distribution of this feature.

We note that, given the premiums were calculated using the already log transformed data, that the data is broadly normally distributed with a fatter right tail than we would expect.

Before we map this data onto the operator value to complete the Target Encoding, we wish to smooth out the encoded values using a similar approach to how we encoded the city variable.

In order to do this, we require a UK average premium across all operators. We will then weight the premium depending upon the sample sizes for each operator.

As we would expect, the median premium is $1$, which reflects no premium or discount at all. We now map on the sample sizes for each operator before using the smooth_medians function defined earlier to weight the encoded values and smooth out extremities.

As we can see, the smoothed asset premiums depicts a slightly more normal distribution, albeit still with fatter tails.

We now map the smoothed_premiums onto the transformed_X_train data set and tidy up the DataFrame.

Finally, we standardise all of the variables to allow for consistent coefficient comparisons in Section 6.

As we can see above, we have both transformed_X_train and transformed_y_train ready for the model, with all data having been transformed and encoded, as required. All of the data has then been standardised.

We now proceed to repeat these encoding, transformation, and standardisation steps with the test data before training and assessing a Linear Regression model.

6. Modelling & Model Optimisation

6.1 Introduction

In this section, we begin by transforming our test data in the same manner as we transformed our training data to ensure the data is ready to be used for modelling. We note this will be using transformations as influenced by the training data only to minimise data leakage.

We then train a Linear Regression model on the training data before applying it to our test data. Finally we evaluate the model and consider some further modelling approaches before iterating these models to achieve a model with the best performance possible.

6.2 Preparing The Test Set

We start by applying the logarithm transformations to both weekly_rent in y_test and the beds variable in X_test.

Next, we calculate the age variable and transform it with the $f(x) = \sqrt{x}$ transformation, before dropping the build_date variable.

We now OHE the room_type, dropping the first column to maintain a lower dimensionality.

We now target encode both city and operator, making sure to use the encoding values from the training set. Any NaN values for cities or operators not in the training set are filled with the global UK medians, calculated across both cities and operators, respectively on the entire training set.

Finally, we drop the asset_id column.

We now have a fully transformed X_test and y_test, ready for modelling.

6.3 Linear Regression Model

6.3.1 Training the Model

We now fit a Linear Regression model on our data and use evaluation metrics, such as the Mean Square Error ("MSE") and the $R^{2}$ score, to get a sense of the accuracy.

6.3.2 Evaluating the Model

We write the below as a function to expedite our iterations of the model as we try and evaluate and compare different approaches.

Here we see that the MSE for the training set is only $0.0197$ to four decimal places, which is low and shows that the model is fitting well to the training data and has a high degree of accuracy. We note that the test data has an MSE of $0.0241$ to four decimal places, which is higher than the training set but still low. The MSE of the test set being close to that of the training set suggests that the model generalises well and is not overfit to the training data. We note that we expect the MSE to be higher for the test data than the training data as the model will always marginally overfit to the data it is trained upon but the small change is a positive sign.

An $R^{2}$ score of $0.8544$ and $0.8280$ for the training set and test set, respectively, is also a very positive sign. This suggests that ~$83\%$ of the variance in the weekly_rent variable is explained by our independent variables in our model. The remaining ~$17\%$ is explained either by variables we do not have or by inherent variation. An $R^{2}$ score of ~$83\%$ is high and suggests that the Linear Regression model is indeed suitable, especially given the context of the data as being in the socially-influenced market of real estate rental rates, where factors such as marketing can play a significant role.

We now consider the Root Mean Square Error ("RMSE") to derive further measures of accuracy. We note that we are considering both the RMSE and the values of weekly_rent in their log-transformed states, which reduces interpretability but is acceptable for the point of comparison.

We note that the RMSE represents ~$2.61\%$ of the average weekly_rent for the training set and ~$2.89\%$ for the test set. This suggests that the relative average error of the two sets is quite low, again providing support for the efficacy of the model.

Furthermore, the RMSE as a proportion of the range of the weekly_rent variable is ~$6.20\%$ and ~$7.61\%$ for the training and test set, respectively. This suggests the RMSE is a small proportion of the target range, again suggesting the model performance is good.

6.3.3 Residual Analysis

We now consider the residuals of the model.

As we can see, the residuals are fairly random and scattered around zero, as we would expect. The residuals appear to be fairly homoscedastic with approximately equal variance in the residuals as we move up and down the x-axis. We note that there are some outliers, for example the two points in the top right and some points near the bottom, which we may decide to rectify later.

We now consider the distribution of the residuals below.

As we can see, the residuals are normally distributed around zero, albeit we note there are some outliers in the right and left tails. We examine the QQ plot of the residuals below.

We can see that there is a strong linear pattern but note that there are two outliers towards the upper right of the plot as well as some outliers towards the left tail.

6.3.4 Dealing With Outliers

We will investigate the outliers further to decide whether they need to be removed or not. We start by calculating the absolute standardised residual and considering those with a score above three as outliers.

There are four residuals, which when standardised pass our test for being considered outliers. In order to examine these further, we examine these results from the original training set prior to all our transformations.

As we can see there are four outliers, which were identified through both the residual plots and the standardised residual test. Immediately we see that iQ operated assets comprise three of the four outliers. This is unsurprising as iQ has some of the most beds in the country and operates in a vast majority of the UK markets. iQ therefore has a very large range of achieved weekly_rents, which has probably led to some confusion when considering the influence of the operator feature.

The first outlier is a for non en-suite rooms from the Unest asset in Carlisle. We note that the data is accurate and that this outlier status is likely a result of Carlisle having an incredibly low supply of PBSA on account of there being very little demand for PBSA there. Furthermore, due to Carlisle's inherently low-value real estate market, the rents for any PBSA asset will be low in Carlisle in comparison to other cities, marking the city as one of the extremities on a national scale. This is also likely to contribute to its outlier status. We are not concerned with the data here and leave this entry as it is.

The second outlier corresponds to the studio rooms in iQ Castings in Huddersfield. We note that this has been verified as an accurate price and is in line with iQ's other asset Little Aspley House in Huddersfield. We leave this as it is.

The final two outliers correspond to the en-suite room type in two different iQ assets in London. We note that these two assets are iQ Bloomsbury and iQ Hammersmith, respectively. Both assets are in incredible locations in London, notably Bloomsbury, which is directly adjacent to numerous universities. Given this, it appears that these rates, whilst high, are indeed accurate and a byproduct of the the extremely desirable sub-markets within London that these assets are situated in. Given this, we will leave these two data points in the data set.

6.3.5 Coefficients & Collinearity

We now consider the standardised coefficients of the Linear Regression model to evaluate which factors are having a significant effect on the model and which are not.

As we can see, clearly the city feature is the most significant, given it has the largest coefficient. We note that both beds and room_type_Non En-Suite are the least significant.

City being the most important is as we expected. Furthermore, age and room_type_Non En-Suite having negative coefficients makes sense, given the 'base' case (a zero in all the OHE room_type columns) is an en-suite and we would expect a non en-suite to be cheaper than an en-suite, were all else fixed.

Below, we consider a correlation matrix for all the features and the independent weekly_rent variable to assess which dependent variables are having a strong linear correlation with weekly_rent and to check for multicollinearity.

We note that as above city has the largest correlation with weekly_rent. Furthermore, we note that none of the variables have any multicollinearity, with the largest correlation being between room_type_Studio and room_type_Non En-Suite of $-0.29$.

6.3.6 Model Iterations

Given the above correlation plot, we will not consider removing any features due to multicollinearity. However, we will consider removing both the beds and room_type_Non En-Suite feature given both their small absolute coefficients and low correlations with weekly_rent to see if that improves the model via simplification.

We start by removing the beds feature.

We now train a new model on this refined data set before evaluating it using the metrics we used previously to decide if it is a more accurate model or not.

We note that this is broadly a very similar outcome to what we had before, showing that beds was having very little impact. Broadly this model is marginally inferior to that of the original model with a slightly higher MSE and a lower $R^{2}$ score, but still generalises well.

We now consider a model with just the room_type_Non En-Suite feature removed.

This model is also marginally inferior to the original model. It has inferior training metrics and a marginally inferior test MSE and a $R^{2}$, suggesting a similar model that also generalises well, albeit slightly inferior. We would take the original Linear Regression model over both of these iterations due to its superior performance.

Other refinements we could employ include looking at regularisation techniques such as ridge or lasso. However, we note that given the model currently generalises well to the test data, as evidenced by the comparable $R^{2}$ scores between the train and test sets, as well as the low dimensionality given the Target Encoding we employed, alongside the lack of multicollinearity, regularisation is unlikely to improve the model and therefore we will not explore the idea in this study.

We now wish to consider other potential regression models to see if they perform better on the data. We note that the residual analysis above suggests the data is fairly homoscedastic and appears to have normally distributed residuals, which is why the Linear Regression model performed well.

However, we wish to see whether some more complex models that can handle non-linear relationships can better model the data. In the next two sections we shall consider two further models - Random Forest and Gradient Boosting. We will compare the model's accuracies and eventually choose one to take forward for refinement.

6.4 Random Forest Model

6.4.1 Training the Model

We start with a Random Forest model. Given that the Linear Regression model we are using is the original model, we will compare to that one by training the Random Forest model on that data set.

We note that whilst a Random Forest model does not require standardised data or normally distributed data, which we have following our data transformations, it can work with this data and so we refrain from re-transforming the data.

6.4.2 Evaluating the Model

As we can see, the MSE on both the training and test sets is lower than that of the Linear Regression Mode. Moreover, with $R^{2}$ scores of $0.9816$ and $0.8386$ for the training and test sets, respectively, this model appears to be predicting the variability in the data to a higher degree than the Linear Regression model. However, we note that the larger drop off between the training set $R^{2}$ score and the test set $R^{2}$ score suggests that this model is overfit to the training data and is not generalising as well to the unseen test data. We may be able to remedy this using some parameter tuning.

First we examine how significant each feature is in this model.

Again, similarly to the Linear Regression model, we see that the city variable is by far the most important. Furthermore, we note that room_type_Non En-Suite is the least important. Perhaps a Random Forest model without this feature would be an improvement.

6.4.3 Residual Analysis

We now consider the residuals of this model. We note that, as a non-linear model, we do not require homoscedasticity, normally distributed residuals, or a linear QQ plot. However, we examine these residuals for completeness and as a point of comparison.

Whilst the residuals are centered around zero, we note that there does appear to be a slight heteroscedasticity here, with the variance of the residuals appearing larger as we move along the x-axis. There also seems to be a slight pattern in the residuals with the residuals increasing as the predicted values does.

Below we plot the distribution of the residuals.

We note that the residuals appear to be broadly normally distributed albeit less so than the residuals for the Linear Regression. Furthermore, there is a rough symmetry here that suggests little bias.

Below we plot the QQ plot for the residuals from this model.

Here we see that whilst there is a linear relationship in the main, suggesting a degree of normality, the tails deviate at both ends. This suggests the tails of the distribution are fatter than that of a normal distribution, unlike when we considered the residuals from the Linear Regression model, which were broadly normal.

6.5 Gradient Boosting Model

6.5.1 Training & Evaluating the Model

We now consider another non-parametric model in Gradient Boosting to see if we can garner superior results to that of the Random Forest. Again, for the point of comparison, we train this model on the original data set.

We note that this model also appears to outperform the Linear Regression model. Furthermore, we note that, the $R^{2}$ score for the training data is lower for the Gradient Boosting model at $0.9287$ than the Random Forest model at $0.9816$, and that the $R^{2}$ score for the test data is notably superior in this model at $0.8550$ in comparison to $0.8386$, suggesting that this model generalises better to unseen data as there is a smaller drop off, whilst increasing on accuracy for the test data.

That being said, we should still note that there is a drop off, albeit smaller, between the $R^{2}$ score for the training data and the $R^{2}$ score for the test data, suggesting once again that there is room for improvement to prevent the apparent overfitting to the training set.

We now consider the significance of each feature in this model.

Again, we can see that city is by far the most important feature and that room_type_Non En-Suite is the least. Interestingly, in comparison to the Random Forest Model, the beds feature has a smaller significance in this model and is more comparable to room_type_Non En-Suite.

6.5.2 Residual Analysis

We now consider the residuals of this model, again noting the stipulations required for the Linear Regression residuals are not present. We include the as a point of comparison.

Here the residuals appear to be fairly homoscedastic albeit there are some outliers which may affect the histogram and QQ plots. We now consider the histogram of the residuals.

As expected, given the residuals plot above, the histogram shows a fairly normal distribution, albeit with fatter tails. We now consider the QQ plot for these residuals.

Again we have a broadly linear pattern, with a similar deviations at the tails to the Random Forest model. This suggests that this Gradient Boosting model has also got a distribution with slightly fatter tails.

6.6 Iterating The Selected Model

The Gradient Boosting model is superior from a $R^{2}$ and MSE perspective to the Random Forest model. Furthermore, the Gradient Boosting model experiences a smaller drop off from test $R^{2}$ to training $R^{2}$, suggesting a better baseline generalisation than the Random Forest model. Although the Linear Regression model is simpler and shows an even better generalisation between test and train sets, its $R^{2}$ score is inferior to both the Random Forest and Gradient Boosting models and therefore we have elected to select the Gradient Boosting model moving forward. Given this, we now want to consider tuning the model to see if we can achieve marginally better performance.

Below, we use Grid Search to seek the best hyperparameter values for the model, so we can train a superior and more accurate model.

We can see that the model has a low MSE on the training set and a low MSE on the test set. Moreover, the $R^{2}$ score on both the training and test set has improved. However, we note that the drop off between the two is larger, which suggests the model is overfitting to the training data.

We now use Cross Validation to check the robustness of the model by training it on five different subsamples of the transformed_X_train data set and then testing it using the $R^{2}$ score.

Again, we note there is a drop from the mean cross validation $R^{2}$ score of $0.8911$ to the test $R^{2}$ score of $0.8740$, further supporting the idea that the model is overfitting, albeit we do note it is not a substantial drop off. However, the Gradient Boosting model is known for being complex and so this overfitting may be a result of that.

Below, we attempt to combat this. We train another Gradient Boosting model where we reduce max_depth in order to reduce complexity, whilst simultaneously reducing the learning_rate to avoid overfitting and increasing n_estimator to compensate and stabilise the model.

As we can see, we have sacrificed some accuracy in both the MSE and $R^{2}$ scores with this version of the model. However, we notice that the difference between the training $R^{2}$ score and the test $R^{2}$ score is approximately $0.0633$, which is significantly less than before albeit more than twice the drop off in the original Linear Regression model. However, this model does have higher overall metrics and we note that broadly a drop off of $0.0633$ is probably acceptable. Furthermore, this model has a similar test $R^{2}$ score to that of the original Gradient Boosting model, yet benefits from better generalisation. Therefore, we have found a model that has a decent generalisation with an overall $R^{2}$ score on the test set of over ~$85\%$.

We note as well that the average cross-validation $R^{2}$ score is closer to the one we are achieving on the test set, further exemplifying this model's ability to generalise.

As well as optimising our model, we could also use the Ensemble Method by averaging the predictions of our optimised Gradient Boosting model and the original Linear Regression model in an attempt to attain both a high test $R^{2}$ score whilst maintaining a good level of generalisation.

As we can see, this model maintains the overall test $R^{2}$ score around ~$85\%$, and also has superior generalisation, with the model being less overfit to the training set, as exemplified by the smaller gap between the training and test $R^{2}$ score of around $0.0447$. Moreover, we note that the MSE for the test set is not too inferior for this ensemble model in comparison to our optimised Gradient Boosting model.

We now consider the two models. One is our optimised Gradient Boosting model whilst the other is our ensemble model which incorporates an average of both the high-scoring optimised Gradient Boosting model with the lower scoring Linear Regression model, which is better at generalising to the training data and less prone to overfitting.

Whilst only using one model is simpler, the averaging of the two does have a logical foundation, with the Gradient Boosting model providing the ability to model anything that is non-linear with the original Linear Regression model providing better generality and also appropriate given the homoscedasticity of the data. Given this logical concept of combining the best of both a model that is prone to overfitting but can model non-linear trends with a linear model that generalises well and is reflective of the data, we are going to proceed with the ensemble model.

6.7 Examining the Ensemble Model

For completeness, we examine the residuals, the distribution of the residuals, and a QQ plot.

Here, as expected, we see a fairly good pattern, with the residuals scattered around zero, signs of homoscedasticity, and the two outliers in the top right accounted for through previous explanation.

As expected, the residuals are broadly normal with a symmetry suggesting zero bias.

The QQ plot shows good signs of normality, given we are incorporating a linear model in our ensemble. We note the two obvious outliers in the right tail have been accounted for and there is a slight skew in the left tail but it is not too egregious.

6.8 Conclusion

In this section, we have trained various different models on the transformed data. We began by considering a simple Linear Regression model which proved to be accurate and generalise well. We then considered non-parametric models like the Random Forest and Gradient Boosting. Both provided stronger scores than the Linear Regression model but were clearly prone to overfitting.

After initially deciding to pursue the Gradient Boosting model, we refined the hyperparamaters using Grid Search to arrive at a version of the model with a higher $R^{2}$ score, albeit in doing so we sacrificed generality. It was the clear the model was overfitting. In redefining a new model with different hyperparameters, we were able to reduce this overfitting and increase the generality, with the cost being a reduced $R^{2}$ score.

We then considered the Ensemble Method, which involves averaging the predictions of multiple models to smooth out any outliers or overfitting. We took both the Linear Regression model and our optimised and high $R^{2}$-scoring Gradient Boosting model and averaged the two. The result was a model with a similar $R^{2}$ score and MSE to the optimised Gradient Boosting model but with a more logical and interpretable origin as well as superior generalisation. Given this, we decided to use this model going forward.

7. Model Evaluation & Interpretation

7.1 Introduction

In this section, we shall evaluate the chosen model across a range of metrics. We will then interpret both the residuals and importance of features for the model within the real-world context of PBSA rental rates. Finally, we examine what this has told us about the PBSA rental market in the UK in a wider context.

7.2 Ensemble Model Performance

We chose to move forward with an ensemble model that averaged the prediction of a simple Linear Regression model and the more complex Gradient Boosting model. The reasoning for this was that the two models combined provides a degree of robustness, given that the Linear Regression is less prone to overfitting but the Gradient Boosting model can model non-linear trends more accurately.

Despite our data proving to be fairly normal and the Linear Regression model performing well, we found that the Gradient Boosting model performed better from an $R^{2}$ perspective but was prone to overfitting. Therefore, taking an average of the two provides a higher performing model that generalises better than just the Gradient Boosting model alone and we deemed this was worth the expense of losing the high level of interpretability one has when they use just a Linear Regression model.

Below we test the ensemble model by considering the MSE, $R^{2}$ score, and the RMSE as a proportion of the variance found in the data set.

The MSE for the training set is $0.0143$, which is low and shows that the model is fitting well to the training data with a high degree of accuracy. In comparison, the test set data has an MSE of $0.0211$ to four decimal places, which is higher than the training set, as we would expect, but still low. Moreover, the two MSEs being fairly close suggests that the model generalises well and is not overfit to the training data.

Furthermore, the training data has an $R^{2}$ score of $0.8941$, whilst the test set score is $0.8494$. This suggests that ~$85\%$ of the variance in the weekly_rent variable is explained by our independent variables in our model on the test set. The remaining ~$15\%$ may be explained by inherent variation in the variable or by other features we do not have data for in our dataset. We note that an $R^{2}$ score of ~$85\%$ is high and suggests that the ensemble model is performing well. Furthermore, we note that typically real estate rents can be socially-influenced, given specific areas moving in and out of popularity as time progresses, and the unaccounted for aspects, such as marketing and curb appeal also having an affect. Given this, attaining an $R^{2}$ score of $90\%$ or more is unlikely on such data, given the inherent variation we discussed.

We now examine the RMSE.

Above we have the RMSE represents ~$2.23\%$ of the average weekly_rent for the training data and ~$2.70\%$ of the average weekly_rent for the test data. This is a low relative average error, and is lower than the Linear Regression model alone. This suggests that the model is performing well.

We also note that the RMSE as a proportion of the weekly_rent variable represents ~$5.29\%$ for the training data and ~$7.12\%$ for the test data. Again, this represents a small proportion of the target range, suggesting the model's performance is strong.

7.3 Residual Analysis

We now examine the residuals of the test data for the ensemble model.

We note that since we are averaging out the Linear Regression model with a model that does not require homoscedasticity, this condition is less significant. However, for completeness, these residuals appear to show some signs of 'fanning out' as we move along the x-axis, which is a sign of heteroscedasticity.

We now test to see if there are any outliers in this test set, as measured by our test earlier. Again we standardise the absolute values of the residuals and compare them to our condition of three.

Here we see that there are no observed outliers in the test set under this model. We now consider the distribution of the residuals and the QQ plot.

We note that the distribution has a fatter left tail than a normal distribution, suggesting a slight negative skew. Again, normality is not so important here given we are averaging the Linear Regression Model with a non-linear model.

We examine the QQ plot below.

As the histogram suggested, the deviation at both the right and left tail below the line suggests the negative skew we saw. We note that it is broadly normal for the majority of the quantiles, however.

This slight negative skew at the tails suggests that the model is prone to over predicting the value of the weekly_rent.

7.4 Feature Analysis

We now consider the importance of each feature. We examine the coefficients of the Linear Regression model that forms part of the ensemble model, as well as considering the feature importance for the optimised Gradient Boosting model, which forms the other part of our ensemble model.

As we saw earlier, the most important feature in both models is clearly the city feature, which is twice as important as any other feature. This is as expected given that different cities have different demand drivers within the PBSA market, such as number of universities, reputability of universities, number of students, current number of PBSA beds, and the resulting supply-demand metrics that result from this. Moreover, even within a wider real estate context, different cities have varying degrees of competing land uses. For example, prime London real estate could be commercial or residential and developers need to see a certain amount of profit for it to be worth them building PBSA. Therefore, rental rates in London need to be higher to ensure that a developer gets an appropriate yield on cost for their development.

The next set of features have similar importance in both models, albeit in slightly different orders : room_type_Studio, room_type_One Bed, and operator. In both models we have that the room_type_Studio is the most important of the three. This makes sense as compared to the base model of an en-suite (given the OHE technique), a studio should garner significantly more rent due to the privacy and sole-use of kitchen it offers the occupant, as well as the increased size. For both models room_type_One Bed was nearly as important, which is for similar reasons regarding the superior product being sold.

The Gradient Boosting model considered operator to be more important than room_type_One Bed, whereas the Linear Regression model had it as less significant with a coefficient of $0.2278$ in comparison to $0.3185$. Either way, both models ascribe importance to it. Contextually, this also makes sense given that more premium operators, with a higher standard of service, are able to attain higher rents. Moreover, more premium operators tend to operate more premium buildings, which have greater amenities, as this is all part of the premium offering. We have not accounted explicitly for amenity quality amongst assets but it would be logical that an asset with a greater offering of amenity could charge higher rents than one with less, all else being equal. However, we note that it having less importance than city or room_type_Studio makes sense, given that irrespective of how premium an operator is, it will be somewhat curtailed by the market dynamics of the city the asset is in and by the nature of the product it is selling. Students are unlikely to pay significantly more for an en-suite from a better operator as a studio is just a far superior product. Moreover, the best operator in Carlisle is going to struggle to charge rents higher than a mid-level operator in central London, due to the nature of the sub-markets the assets are in.

The next most important factor in both models was age of asset. We would expect there to be some correlation between age and weekly_rent as it makes logical sense that newer buildings have better amenity and more curb appeal and therefore can charge higher rents. However, we note that there are more factors at play here. As PBSA has surged in popularity as an asset class, developers sought to follow this popularity resulting in the building boom in PBSA witnessed between 2014 and 2020, as seen in Section 3.3. This surge in PBSA assets resulted in multiple tall buildings being built in city centres, becoming hubs for students. Naturally, local residents may not always favour their area becoming a student hub, resulting in pressure for local governments to restrict the development of PBSA. As a result, councils have made the development of PBSA more difficult with the notable example of the London Plan, with certain London Boroughs, such as Southwark, making it even more difficult to build with large affordable contributions required. This has caused somewhat of a slowdown of PBSA development, further exacerbated by the majority of the most obvious and best sites in proximity to the university having already been developed during the early PBSA building boom. This leaves less appealing sites, which may not be viable, given stringent local government measures. As a result, whilst newer buildings would typically expect to achieve higher rents, we have that this effect is somewhat offset by the older assets tending to be in more prime locations within their local sub-markets and we have already seen how significant location is with regards to rents, as evidenced by the macro-location city feature.

The final two features are beds and room_type_Non En-suite. The Linear Regression model considers beds half as significant as room_type_Non En-Suite, whereas the Gradient Boosting model considers them equally unimportant with room_type_Non En-Suite being marginally less so. Either way, both contribute little to the model. With regards to beds, this makes sense. A larger asset may have more amenity space, but that amenity space is shared with more people, somewhat diluting the effect. Moreover, most students probably don't pay that much attention to the size of the asset unless it is at the extreme end of the spectrum, which probably means for the majority of the assets, beds does not have an impact. With regards to room_type_Non En-Suite, this probably stems from the fact that our box plots in Section 3.3 highlighted - there doesn't appear to be that big of a difference between the en-suite and non en-suite room types from a weekly_rent perspective. Whilst we would expect it to have a slightly negative effect (and it does), it appears to not be as strong as one would think. We have that the gap between a studio and an en-suite is more significant than the gap between an en-suite and a non en-suite, suggesting that students value having to not share a kitchen more than they value having to not share a bathroom. Alternatively, it could be the extra tangible space a studio offers which is the decider, given en-suites and non en-suites feel a similar size due to the en-suite being a separate room and therefore less visible and tangible.

7.5 Conclusion

In conclusion, we have discovered that the ensemble model can predict PBSA rental rates using the features provided with a strong degree of accuracy, with an $R^{2}$ score of approximately $85\%$. Moreover, we have that by far the most important feature for predicting rental rates for PBSA is the city. This makes sense as sub-market dynamics are going to drastically influence the rental rates across all real estate asset classes. Moreover, we learned that the age of a building is less significant than one would imagine, with various reasons as for why posited. Finally, we found that the size of an asset has little to contribute to the rents that can be achieved and that the difference between non en-suites and en-suites is not considered incredibly significant by our models either.

8. Conclusion

In this project we have examined using machine learning methods to model the rental rates achieved in the PBSA market.

We started by importing some data and cleaning it up, deciding on the features we deemed important enough to include in our model. We then performed EDA to ascertain the distribution of our data, which would inform transformations we would make later. We then remedied any data points considered outliers through a mixture of corrections and removals. The first model we wanted to try was Linear Regression due to the log-normality of the target variable and the simplicity of the model. We therefore showed that the data could indeed be transformed to be suitable for such a model. After satisfying this, we moved on to Section 5, where we performed the feature engineering. Here we got our data ready for modelling through a series of transformations and standardisations. Following this, we considered three different types of models - the aforementioned Linear Regression, as well as two non-linear models in Random Forest and Gradient Boosting. After comparing the models, we decided to use a Gradient Boosting model, which we iterated to improve. Following iteration, we considered a further development by averaging the predictions of the optimised Gradient Boosting model with the original Linear Regression model to create an ensemble model. This model was selected going forward. We then evaluated this model before interpreting the results in the context of the PBSA rental market.

We found that PBSA rental data could be modelled accurately using this model with an MSE of $0.0211$ and a $R^{2}$ score of $0.8494$ on the test set, with the model only losing ~$4.5\%$ from the training $R^{2}$ score to the test $R^{2}$ score, evidencing that the model was generalising to the unseen data well. The residual analysis on the test data suggested our residuals were slightly negatively skewed, which was not an overall concern given our ensemble method approach used a non-linear model in the Gradient Boosting method, which somewhat mitigated this stipulation for a simple Linear Regression model.

This model showed that by far the most important feature was the city. Given every city has a different market, this was the distinguishing factor. Other important factors included studio rooms, one-bed rooms and the operator, whilst the other factors had less significant influences. This makes sense as studios and one-beds are a notably superior product to en-suites and the quality of operator and the service level provided varies wildly across the PBSA landscape. We note that in order to achieve the highest rents, one would need to be in London, offering a premium service on an asset with studios and one-beds only. However, we note that this is only part of the equation, with operational and development costs in London being significantly higher and a studio-only asset being able to provide less rooms than a cluster-led approach, meaning that whilst higher rents may be achieved on a room-by-room basis, the bottom line of profitability has not been accounted for here, meaning that, as we see in reality, there is profit to be found across the spectrum of cities, operator quality, and asset mix in the PBSA space and there is not a blanket one-size-fits-all approach.

Going forward, there are several next steps. Having access to larger amounts of data, such as room sizes, could prove influential and well worth exploring if the data was available and accurate. Moreover, on several occasions we found that the sub-market dynamics within a city were more influential than one would imagine. This is particularly prevalent in London, but is likely to be the case across the UK. Having said this, a future study could use the removed postcode data to interpret which assets are close to one another and close to important amenities in an attempt to explore the sub-market dynamics of the PBSA space. This would serve as an excellent complement to this study, which sought to model PBSA rents on a UK-wide level, which was achieved with a reputable degree of success.